{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt2Model - Generate Deployable Models from Instructions\n",
    "\n",
    "[Prompt2Model](https://github.com/neulab/prompt2model) is a system that takes a natural language task description (like the prompts used for large language models such as ChatGPT) to train a small special-purpose model that is conducive for deployment.\n",
    "\n",
    "In this demo, we demonstrate how to use Prompt2Model to create a model that answers questions over documents, but you can adapt it to any task you like by changing the initial prompt and adjusting the following design decisions appropriately. Every place that has a comment saying `CHANGE THIS` is a variable that you can change to adapt the demo to your task.\n",
    "\n",
    "You can run the demo locally or in Colab, and a GPU is recommended. If you are running in Colab, you can run with a T4 GPU runtime with the default parameters, but if you can get an A100 GPU you can increase the batch size parameter and training/testing will be faster.\n",
    "<a href=\"https://colab.research.google.com/github/neulab/prompt2model/blob/main/colab_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "If you have any questions or feedback, please feel free to contact us!\n",
    "\n",
    "- **Github:** open an [issue](https://github.com/neulab/prompt2model/issues) or submit a PR\n",
    "- **Discord:** join us on [discord](https://discord.gg/UCy9csEmFc)\n",
    "- **Twitter:** reach out to [@vijaytarian](https://twitter.com/vijaytarian) and [@Chenan3_Zhao](https://twitter.com/Chenan3_Zhao)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up\n",
    "\n",
    "First, start out by installing prompt2model from pypi."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install prompt2model"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "prompt2model requires that you use a base LLM to help out with various parts of the training process. The default is OpenAI's `gpt-3.5-turbo`, but you can use any model supported by [litellm](https://github.com/BerriAI/litellm). Set the appropriate API key as an environment variable. A good way to do this is to create a `.env` file with a single line, like below if you're using OpenAI.\n",
    "\n",
    "```text\n",
    "OPENAI_API_KEY=<your key here>\n",
    "```\n",
    "\n",
    "If you are using Colab, you can create this `.env` file locally, then upload it to Colab by clicking on the file folder on the left side of the screen.\n",
    "\n",
    "And then run the following command to load environment variables from your `.env` file into the running script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install python-dotenv\n",
    "import dotenv\n",
    "dotenv.load_dotenv()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check to make sure that the key is actually imported by printing out the first few characters of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['OPENAI_API_KEY'][:3]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we specify the base model that we want to use here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.utils import api_tools\n",
    "# CHANGE THIS if you want to try a different model\n",
    "api_tools.default_api_agent = api_tools.APIAgent(model_name=\"gpt-3.5-turbo\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Specify your Prompt\n",
    "\n",
    "The most important design decision in using prompt2model is what prompt you will use to specify your task. In order to do so it is best to:\n",
    "\n",
    "1. Explain your task\n",
    "2. Provide a few examples\n",
    "\n",
    "In this demo, we will use the following prompt to specify a **question answering system**. If you want to try prompt2model on a new task, you can write a similar prompt by swapping in a new description and new examples. Note that this format is a bit flexible, so you don't have to follow this *exact* format, but it is a good starting point. You can also see our suggestions on [writing good prompts](https://github.com/neulab/prompt2model/blob/main/prompt_examples.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CHANGE THIS if you want to use a different prompt or tackle a different task\n",
    "prompt = \"\"\"\n",
    "Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.\n",
    "\n",
    "Here are examples with input questions and context passages, along with their expected outputs:\n",
    "\n",
    "input=\"Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.\"\n",
    "output=\"Santa Clara\"\n",
    "\n",
    "input=\"Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi).\"\n",
    "output=\"Vistula River\"\n",
    "\n",
    "input=\"Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries.\"\n",
    "output=\"Europe\"\n",
    "\"\"\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parse the Prompt\n",
    "\n",
    "Next, Prompt2Model parses out the instructions an examples from the prompt.\n",
    "We use the `InstructionParser` to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.prompt_parser import PromptBasedInstructionParser, TaskType\n",
    "\n",
    "prompt_spec = PromptBasedInstructionParser(task_type=TaskType.TEXT_GENERATION)\n",
    "prompt_spec.parse_from_prompt(prompt)\n",
    "print(f\"Instruction:\\n{prompt_spec.instruction}\\n\")\n",
    "print(f\"Examples:\\n{prompt_spec.examples}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieve Model\n",
    "\n",
    "First, we retrieve a base model that we will train. We can use the `DescriptionModelRetriever` to do so. We can set `model_size_limit_bytes` to set the maximum size of a model (in bytes) to retrieve. In this block we use a maximum model size of 3GB by default.\n",
    "\n",
    "This class returns a list of pretrained Hugging Face models. You can choose the first one by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.model_retriever import DescriptionModelRetriever\n",
    "\n",
    "retriever = DescriptionModelRetriever(\n",
    "    model_descriptions_index_path=\"huggingface_data/huggingface_models/model_info/\",\n",
    "    use_bm25=True,\n",
    "    use_HyDE=True,\n",
    ")\n",
    "top_model_names = retriever.retrieve(prompt_spec)\n",
    "pre_train_model_name = top_model_names[0]\n",
    "print(pre_train_model_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieve and Process Dataset\n",
    "\n",
    "Next, `Prompt2Model` searches for datasets on Hugging Face to try to find training datasets that may be useful for your task. Specifically, we use `DescriptionDatasetRetriever`, which looks up datasets that match the description.\n",
    "\n",
    "First we initialize the retriever. This creates the search index so it may take several minutes the first time you run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.dataset_retriever import DescriptionDatasetRetriever\n",
    "\n",
    "retriever = DescriptionDatasetRetriever()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we retriever a list of top datasets for the current prompt (and display their basic data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "sorted_dataset_list = retriever.retrieve_top_datasets(prompt_spec)\n",
    "\n",
    "print(\"#\\tName\\tDescription\")\n",
    "for i, d in enumerate(sorted_dataset_list):\n",
    "    description_no_spaces = d.description.replace(\"\\n\", \" \")\n",
    "    print(f\"{i+1}):\\t{d.name}\\t{description_no_spaces}\")\n",
    "\n",
    "retrieved_dataset_dict = None"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If none of the datasets in the list look useful for your task, you can skip the rest of the section and we won't use any retrieved data.\n",
    "\n",
    "However, if one of the datasets looks useful, set the `retrieved_dataset_name` variable, and continue through the rest of the section. For the question answering example, we will pick the `squad` dataset to train our model, but of course you will want to change this to the dataset that you selected if you're doing a different task.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CHANGE THIS if you want to use a different retrieved dataset\n",
    "retrieved_dataset_name = \"squad\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Existing datasets on Hugging Face have many different formats, but Prompt2Model expects that a dataset should have one input and one output, both of which are strings. In order to solve this, we do a **canonicalization** step, where we convert the dataset into a format that is compatible with `Prompt2Model`.\n",
    "\n",
    "In order to do so, we examine the dataset and find the different configurations that exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "\n",
    "configs = datasets.get_dataset_config_names(retrieved_dataset_name)\n",
    "print(f\"Available dataset configs {configs}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we choose one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CHANGE THIS if you want to use a different dataset configuration\n",
    "chosen_config = \"plain_text\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we read in the dataset and print out an example. You can use this to check which columns you'd like to use for the input and output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "dataset = datasets.load_dataset(retrieved_dataset_name, chosen_config)\n",
    "if \"train\" not in dataset:\n",
    "    raise ValueError(\n",
    "        f\"Dataset {retrieved_dataset_name} does not have a train split.\"\n",
    "    )\n",
    "train_columns = dataset[\"train\"].column_names\n",
    "train_columns_formatted = \", \".join(train_columns)\n",
    "\n",
    "if len(dataset[\"train\"]) == 0:\n",
    "    raise ValueError(\n",
    "        f\"Dataset {retrieved_dataset_name} has no rows in the train split.\"\n",
    "    )\n",
    "example_rows = json.dumps(dataset[\"train\"][0], indent=4)\n",
    "\n",
    "print(f\"Loaded dataset. Example row:\\n{example_rows}\\n\")\n",
    "\n",
    "print(f\"It has these columns: {train_columns_formatted}.\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can set the following variables to the ones that we'd like to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CHANGE THIS if you want to use a different dataset configuration\n",
    "input_columns = [\"question\", \"context\"]\n",
    "output_column = \"answers\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we canonicalize the dataset, and we have properly prepared our dataset for training. We also save it to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "retrieved_dataset_dict = retriever.canonicalize_dataset_using_columns(\n",
    "    dataset, input_columns, output_column\n",
    ")\n",
    "retrieved_dataset_dict.save_to_disk(\"retrieved_dataset_dict\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Dataset\n",
    "\n",
    "Next, we generate some examples for training the model. We can use `PromptBasedDatasetGenerator` to generate these examples.\n",
    "\n",
    "Note that there are a number of hyperparameters here. These are in general good defaults, but you might want to play with them. In particular, this generates 5,000 examples, which may be expensive (roughly $5), so you could choose to generate fewer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.dataset_generator import PromptBasedDatasetGenerator, DatasetSplit\n",
    "\n",
    "dataset_generator = PromptBasedDatasetGenerator(\n",
    "    initial_temperature=0.3,\n",
    "    max_temperature=1.4,\n",
    "    responses_per_request=3,\n",
    "    max_api_calls=10000,\n",
    "    requests_per_minute=80,\n",
    ")\n",
    "generated_dataset = dataset_generator.generate_dataset_split(\n",
    "    prompt_spec, 5000, split=DatasetSplit.TRAIN\n",
    ")\n",
    "generated_dataset.save_to_disk(\"generated_dataset\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finetune the Model\n",
    "\n",
    "Next, we fine-tune the model. To do so we first combine the retrieved dataset with generated dataset and grab our train/validation, and testing splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.dataset_processor import TextualizeProcessor\n",
    "\n",
    "text_processor = TextualizeProcessor(has_encoder=True)\n",
    "text_modified_dataset_dicts = text_processor.process_dataset_lists(\n",
    "    prompt_spec.instruction,\n",
    "    [generated_dataset, retrieved_dataset_dict[\"train\"]],\n",
    "    train_proportion=0.7,\n",
    "    val_proportion=0.1,\n",
    "    maximum_example_num={\"train\": 3500, \"val\": 500, \"test\": 1000},\n",
    ")\n",
    "train_datasets = [each[\"train\"] for each in text_modified_dataset_dicts]\n",
    "val_datasets = [each[\"val\"] for each in text_modified_dataset_dicts]\n",
    "test_datasets = [each[\"test\"] for each in text_modified_dataset_dicts]\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combine the retrieved dataset with generated dataset and use the `GenerationModelTrainer` to finetune the retrieved model. After the finetuning, we save the model and tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.model_trainer import GenerationModelTrainer\n",
    "from pathlib import Path\n",
    "\n",
    "# CHANGE THIS (e.g. to 16) if you have a larger GPU like an A100\n",
    "batch_size = 2\n",
    "\n",
    "trainer = GenerationModelTrainer(\n",
    "    pre_train_model_name,\n",
    "    has_encoder=True,\n",
    "    executor_batch_size=batch_size,\n",
    "    tokenizer_max_length=1024,\n",
    "    sequence_max_length=1280,\n",
    ")\n",
    "\n",
    "args_output_root = Path(\"result/training_output\")\n",
    "args_output_root.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "trained_model, trained_tokenizer = trainer.train_model(\n",
    "    hyperparameter_choices={\n",
    "        \"output_dir\": str(args_output_root),\n",
    "        \"save_strategy\": \"epoch\",\n",
    "        \"num_train_epochs\": 1,\n",
    "        \"per_device_train_batch_size\": batch_size,\n",
    "        \"evaluation_strategy\": \"epoch\",\n",
    "    },\n",
    "    training_datasets=train_datasets,\n",
    "    validation_datasets=val_datasets,\n",
    ")\n",
    "\n",
    "trained_model.save_pretrained(\"trained_model\")\n",
    "trained_tokenizer.save_pretrained(\"trained_tokenizer\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Try it out!\n",
    "\n",
    "Now, you can add input and use your fine-tuned model to do inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.model_executor import GenerationModelExecutor\n",
    "\n",
    "model_executor = GenerationModelExecutor(trained_model, trained_tokenizer)\n",
    "# CHANGE THIS to your own input\n",
    "input = \"Question: How many departments are within the Stinson-Remick Hall of Engineering? Context: The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively.\"\n",
    "response = model_executor.make_single_prediction(\n",
    "    text_processor.wrap_single_input(prompt_spec.instruction, input)\n",
    ")\n",
    "print(response.prediction)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Model\n",
    "\n",
    "After the training, we can evaluate the trained model on the conbined test set with `ModelEvaluator`.\n",
    "This will output a number of metrics indicating how good the answers are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prompt2model.model_executor import GenerationModelExecutor\n",
    "from prompt2model.model_evaluator import Seq2SeqEvaluator\n",
    "\n",
    "import transformers\n",
    "import torch\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "trained_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(\n",
    "    \"trained_model\"\n",
    ").to(device)\n",
    "trained_tokenizer = transformers.AutoTokenizer.from_pretrained(\n",
    "    \"trained_tokenizer\"\n",
    ")\n",
    "\n",
    "test_dataset = datasets.concatenate_datasets(test_datasets)\n",
    "model_executor = GenerationModelExecutor(trained_model, trained_tokenizer, 1)\n",
    "t5_outputs = model_executor.make_prediction(test_dataset, \"model_input\")\n",
    "evaluator = Seq2SeqEvaluator()\n",
    "metric_values = evaluator.evaluate_model(test_dataset, \"model_output\", t5_outputs)\n",
    "print(metric_values)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Words\n",
    "\n",
    "We hope that you found this demo useful!\n",
    "If you have any questions or feedback, please get in contact. And we would love to have community contributions!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "prompt2model",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
